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Abstract: Communication plays an important role in the day-to-day activities of human beio^^i main 
objective of this paper is to help the people who are unable to speak. Visual speech plays a raat^lole in the 
lip-reading for listeners with deaf person. Here we are using local spatiotemporal de^iptons in order to 
identify or recognize the words or phrases from the disable (dumb) people by their jAaotements. Local 
binary patterns extracted from the lip movements are used to recognize the isolatgd\p lyases. Experiments 
were made on ten speakers with 5 phrases and obtained accuracy of about 75% ftjfefceaker dependent and 
65% for speaker independent. While being made comparison with other concf^NJike AV letter database, 
our method outperforms the other by accuracy of 65%. The advantages oLour^method are robustness and 
recognition is possible in real time. v*vx3 

Key words: LBP Top, Webcam, Visual speech, KNN classifier, Silent |pe>|pi interface. 

I. Introductior 

♦ 

In recent days, a silent speech interface has becoming ^tfl^^rnative in the field of speech processing. A 
voice based speech communication has suffered frorrlrlfeA , challenges which arises due to fact that speech 
should be clearly audible, it cannot be masked, induceSMack of robustness privacy issues and exclusion of 
speech disabled person. So these challenges ma^HC!wercome by the use of silent speech interface. Silent 
speech interface is a device, where systern^tnabling speech communication takes place, without the 
necessity of a voice signal or when audible/acoustic signal is unavailable. In our approach, lip movements 



are captured by the use of a webcam, wJflfc^is placed in front of the lips. Many research work focuses only 
on visual information to enhance s£>a^:Bwi«:ognition[i]. Audio information still plays a major role than the 
visual feature or information. But, m\mW cases it is very difficult to extract information. In our method we 
are concentrating on the litp^aiovement representations for speech recognition solely with visual 
information. S^^J 

Extraction of set of visuaoVgervation vectors is the key element on AVSR(audio-visual speech recognition) 
system. Geometric f i^jrST combined feature and appearance features are mainly used for representing 
visual informatio^^Ctwnetric feature method represents the facial animation parameters such as lip 
movement, sh^|£%£pw and width of the mouth. These methods require more accurate and reliable facial 
feature detemoT^which are difficult in practice and impossible at low image resolution. 

In this'*^^', we propose an approach for lip reading or lip movements, where human-computer 
mter^^^n improves significantly and understanding in noisy environment also improves. 



II. Local Spatiotemporal Descriptors for Visual Information 



In this paper, we are concentrating on LBP-TOP in order to extract the features from the extracted video 
frames. The Local Binary Pattern (LBP) operator is a gray-scale which is not varied, always remains the 
same texture, simple statistic, which has shown the best performance in the classification of different kinds 
of texture. For an individual pixel in an image its respective binary will be generated and the thresholds will 
be compared with the its neighborhood value of center pixel as shown in Figure i(a) [1] . 
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Figure. 1. (a) Basic LBP operator, (b) Circular (8,2) neighborhood 



IBP„ = ^ sCgp-ge)?, s'(x) = 



1, x>0 



0, x<0 



p=0 



Where gc denotes the grey value of center pixelfxc, yc} of the local neighborric^f^and gp is related to the 
grey values of P equally spaced pixels on a circle of radius R. A histogj2l|^^:i«eated in order to collect up 
the occurrences of various binary patterns. The definition of neftii^N|rcan be extended to circular 
neighborhoods with any number of pixels as shown in Figure i(b).^^Vs way, we can collect larger-scale 
texture primitives or micro-patterns, like spots, lines and corners^ 



0 




Local texture descriptors have obtained tremendous attention ^n the analysis of facial image because of 
their robustness to challenge such as pose and illumina|t^^1*anges. In our approach a temporal texture 
recognition using local binary patterns were extractedfcAm the three orthogonal planes(LBP-TOP). LBP- 
TOP method is more efficient than the ordinary LiW^n ordinary LBP we are extracting information or 
features in two dimension, where as in LBP-TOR^JS»e extracting information in three dimension i.e, X, Y 
and T. For LBP-TOP, the radii in spatial andjemporal axes X, Y, and T, and the number of neighboring 
points in the XY, XT, and YT planes can alipfcNdifferent and can be marked as RX, RY and RT, PXY, PXT, 
PYT. The LBP-TOP feature is then denofc^k^LBP-TOP PXY, PXT, PYT, RX, RY, RT. If the coordinates of 
the center pixel gtc,e are (xc, yc, fac)>Aajy the coordinates of local neighborhood in XY plane gXY,p are 
given by (xc-RXsin(27rp/PXY),yc+Rl^col^(27rp/PXY), tc), the coordinates of local neighborhood in XT plane 
gXT,p are given by (xc-RXsm/27Tp7PXT),yc, tc -RT cos(27rp/PXT)) and the coordinates of local 
neighborhood in YT plane gY\p jkc, yc-RYcos(2Trp/PYT)„ tc -RT sin(27rp/PYT)). Sometimes, the radii in 
three axes are the same aad^^io the number of neighboring points in XY, XT, and YT planes. In that case , 
we use LBP-TOPP,R for |OT^viation where P=PXY=PXT=PYT and R=RX=RY=RT[i]. 




Figure. 2. (a) Volume of utterance sequence, (b) Image in XY plane (c) Image in XT plane(d) Image in TY 

plane. 
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Figure 2(a) demonstrates the volume of utterance sequence. Figure 2(b) shows image in the XY plane. 
Figure 2(c) is an image in the XT plane providing a visual impression of one row changing in time, while 
Figure 2(d) describes the motion of one column in temporal space[i]. An LBP description when computed 
over the whole utterance sequence it encodes only the occurrences of the micro-patterns without any 
indication about their locations. In order to overcome this effect, a representation which consists of 
dividing the mouth image into several overlapping blocks is introduced. Figure. 3 also gives some examples 
of the LBP images. The second, third, and fourth rows show the LBP images which are drawn using LBP 
code of every pixel from XY (second row), XT (third row), and YT (fourth row) planes, respective 
corresponding to mouth images in the first row. 




Figure 3. Mouth region images (first row), LBP-XY imagj 
LBP-YT images (last row) from one utterance[i]. 




Figure. 4. Features 
(c) Concatenated : 





ond row), LBP-XT images (third row), and 



ock volume, (a) Block volumes, (b) LBP features from three orthogonal planes, 
for one block volume with the appearance and motion[i]. 



T* 



hkL Ikll ll* - -V L*J |J|y 

Miuth move me rt features fmm the whole sequence 
Figure. 5. Mouth movement representation [1]. 



When a person utters a command phrase, the words are pronounced in order, for instance "you-see" or 
"see-you". If we do not consider the time order, these two phrases would generate almost the same features. 
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To overcome this effect, the whole sequence is not only divided into block volumes according to spatial 
regions but also in time order, as shown in Figure. 4(a) shows. The LBP-TOP histograms in each block 
volume are computed and concatenated into a single histogram, as shown in Figure. 4. All features 
extracted from each block volume are connected to represent the appearance and motion of the mouth 
region sequence, as shown in Figure. 5. 

A histogram of the mouth movements can be defined as follows 



HotAir l{§(W)=i]. i=0 nj-l; j=0,l,2- 



III. Multiresolution Features and Feature Selection 



Multi resolution features will provide more accurate information and also the analysis ofdynaijnic event will 
be increased. By using these multi-resolution features, it helps in increasing the num™Pb£ features greatly. 
When the features from the different resolution were made to be connected in serjleyectly, the feature 
vector would be very long and as a result the computational complexity will be ^2*^J- It is obvious that all 
the multi resolution features will not contribute equally, so it is necessary totfrakout features like which 
location, with what resolution and more importantly the types such as. aucearance, horizontal motion or 
vertical motion that are very important. We need feature selectiqri*^ttjis purpose. In changing the 
parameters, three different types of spatiotemporal resolution are pira^rSpTi) Use of a different number of 
neighboring points when computing the features in XY (appear*ca^ XT (horizontal motion), and YT 
(vertical motion) slices; 2) Use of different radii that can capti^^jne occurrences in different space and 
time scales; 3) Use of blocks of different sizes to create global ^ndlScal statistical features [4] [5]. 

IV. Our ! 



Our system consists of three stages, as shown in Fi«Aeo. The first stage is a detection lip movements. The 
second stage extracts the visual features from theS^uth movement sequence. The role of the final stage is 
to recognize the input utterance using an KNl^dassifier. 
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Figure 6. System diagram 
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In our approach webcam is used to capture the lip-movements. Further we will extract the visual feature 
from the mouth region. These features are then fed to the classifier. For speech recognition, a KNN 
classifier is selected since it is well founded in statistical learning theory and has been successfully applied 
to various object detection tasks in computer vision. Since the KNN is only used for separating two sets of 
points, the -phrase classification problem is decomposed into two-class problems, then a voting scheme is 
used to accomplish recognition. Sometimes more than one class gets the highest number of votes; in this 
case, i-NN template matching is applied to these classes to reach the final result. This means that in 
training, the spatiotemporai LBP histograms of utterance sequences belonging to a given class are averased 
to generate a histogram template for that class [1]. 



V. Experiments Protocol and Results 



For more accurate evaluation of our proposed method, we made a design with differ 
including speaker-independent, speaker-dependent, multi resolution. ^ V ^ 



*e*ne-s 



%jiperiments, 



i. Speaker-Independent Experiments: For the speaker-independent experiments, lg&e^ne-speaker-out is 
utilized. We made a training of particular phrase using one speaker and we macfa*ft say the same phrase 
with 10 different speakers and when testing was done we were successful in ge^r^^ne same phrase from 6 
speakers. The overall results were obtained using M/N (M is the total jjumprf of correctly recognized 
sequences and N is the total number of testing sequences). When wg ^ja^xjracting the local patterns, we 
take into account not only locations of micro-patterns but also th^^pVorder of lip-movements, so the 
whole sequence is divided into block volumes according to not ofi^kpatial regions but also time order. 
Figure. 7 demonstrate the performance for every speaker. Th^Bjufts from the second speaker are the 
worst, mainly because the big moustache of that speaker really inferences the appearance and motion in the 
mouth region. 




2) Speaker-Dependent Expe 
utilized for cross validat 
and after testing is perft 





EOgnition performance for every speaker 

For speaker-dependent experiments, the leave-one utterance-out is 
ur approach when a particular speaker is trained with some list of phrases 
with the same speaker 70% of same phrases were identified correctly. 

3) One-One versua^Ji^Rest Recognition: In the previous experiments on our own dataset, the ten-phrase 
classification p < SaMeJ is decomposed into 45 two class problems ("Hello"- "See you", "I am sorry"- "Thank 
you", "You ar^^teome"- "Have a good time", etc.). But using this multiple two-class strategy, the number 
of classifierl^^>ws quadratically with the number of classes to be recognized like in AVLetters database. 
When th(W!fts number is N, the number of the KNN classifiers would be N (N-i)/2. 




Figure. 8. Selected 15 slices for phrases "See you" and "Thank you", 
(vertical motion) is selected, and "_" the XT slice (horizontal motion), 



"l" in the blocks means the YT slice 
7" means the appearance XY slice[i]. 
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Figure-8 shows the selected slices for similar phrases "see you" and "thank you". These phrases were the 
most difficult to recognize because they are quite similar in the latter part containing the same word "you". 
The selected slices are mainly in the first and second part of the phrase; just one vertical slice is from the 
last part. The selected features are consistent with the human intuition. 

Table I 



Results from One-to-One and One-to-Rest Classifiers on Semi-Speaker-Dependent Experiments (Results in 

The Parentheses are From One-to-Rest Strategy) + 



Features 


Blocks 


Thud lesl(O-R) 


Tliree-foM{0-R) ^* 


LBP-TOPsj 


1x5x3 


58.92 


63 45 (/J 


LBP-TOP^s; 


1x5x3 


6201 


67.23 . l7lT 



Conclusion 

A real-time capture of lip-movements using webcam for recognition of speech 
help the disable person. Our approach uses local spatiotemporal descriptors £m*I5j 
utterance. LBP-TOP is used to extract the features from the captured images' p^er 
ten speakers and the lip movements were converted into speech vacy^£fi*ently. Compared to other 
approach, our method outperforms the other by accuracy of 70%. ^V^^V^ 




Dosed, in order to 
' recognition of input 
^riments were made on 
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